Transcribing Names with Foreign Origin in the Onomastica Project

نویسنده

Joakim Gustafson

چکیده

This paper studies the problem of transcribing foreign names. The transcriptions of first names in five languages have been studied to show examples of how this problem has been dealt with in the Onomastica Multi-Lingual Pronunciation Dictionary of European names. The paper describes this dictionary and the methods used to do the automatic transcriptions for the Swedish part. INTRODUCTION Names have a different morphology and phonology compared to ordinary words. This is the reason why the normal letterto-sound rules used in general text-tospeech systems are inadequate for the transcription of proper names. To deal with the name pronunciation problem, name transcription rules and a name dictionary have to be developed. The objective of the Onomastica project is to produce such rules and a dictionary of European names that will be published on a CD-ROM. This paper will present the problems encountered in the work on this project, and how these have been solved. The transcriptions of first names in five languages are examined to illustrate the problem. The Swedish name transcription system will be presented as well. THE ONOMASTICA DATABASE The objective of the ONOMASTICA project, funded by the LRE-programme, is to build a quality controlled, multilingual pronunciation dictionary of proper names in Europe. The project covers eleven languages: Danish, Dutch, English, French, German, Greek, Italian, Norwegian, Portuguese, Spanish and Swedish. Transcription of up to 1.000.000 names per language will be produced in a semi-automatic way. The ultimate pronunciation dictionary should include a carefully verified transcription of each name, but due to the limited resources only a subset of the name list can be transcribed and verified manually. The names are transcribed in three different quality bands, where the first band includes transcriptions judged to be correct for some owners of the name. The second band gives transcriptions that are acceptable to a native speaker/listener. The third band contains names that have been transcribed automatically, without manual checking. The names in bands I & II were chosen according to their frequency in the telephone directory, so that a cumulative coverage of at least 80% was obtained. From the Swedish database, described in Table 1, the names that occurred more than five times were selected for transcription in band I, obtaining a cumulative coverage of between close to 95 % for surnames and 100% for town names (almost all places have more than five subscribers). Table 1. The Swedish Name Database Name category # of names names with frequency >5 Surnames 228048 46859 Place names 6373 6120 Titles 27055 5370 Street names 65196 39822 First names 6085

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using two-level morphology to transcribe Swedish names

Names are difficult to handle for normal letter-to-sound rules, since these usually are designed for ordinary words. The structure of Swedish names differ from ordinary words but their multi-morphemic structure make them suitable to analyse with a morphological analyser. The paper presents the work on names from the Swedish telephone directory, as part of the ONOMASTICA project [7], including a...

متن کامل

Issues in the pronunciation of proper names: the experience of the Onomastica project

This paper discusses several relevant issues concerning the pronunciation of proper names. Although it was motivated by the experience of the ONOMASTICA European project, it re ects a personal view, not constituting, therefore, an o cial document of the project. The major outcome of the project was the production of two important linguistic resources: the set of 11 national pronunciation lexica...

متن کامل

Phonetic Transcription Standards for European Names

exchanging national names amongst the partners to create a matrix of 'nativised' pronunciations for each (thereby) foreign name in each other language. This paper details the standards identified for phonetic transcription of names as part of the ONOMASTICA project, a European-wide research initiative for the construction of a multi-language pronunciation lexicon of proper names. The main desig...

متن کامل

The Onomastica Interlanguage Pronunciation Lexicon

This paper presents one of the linguistic resources developed in the scope of the ONOMASTICA European project. In terms of size, the interlanguage pronunciation lexicon represents a very small fraction of the global lexicon produced by the project. The interest of this research tool, however, derives from its particular contents: 1; 000 names from each of the 11 languages represented in the con...

متن کامل